Journal of Theoretical and Applied Information Technology 
15" April 2024. Vol.102. No 7 
© Little Lion Scientific 


SATIT 


ISSN: 1992-8645 E-ISSN: 1817-3195 


www. jatit.org 


DESIGN OF RECOMMENDENDATION SYSTEMS USING 
DEEP REINFORCEMENT LEARNING — RECENT 
ADVANCEMENTS AND APPLICATIONS 


KRISHNAMOORTHLS!, GOPAL K. SHYAM? 


‘Research scholar, Department of CSE, Presidency University, Bengaluru, India 
Professor, Department of CSE, Presidency University, Bengaluru, India 
E-mail: 'krishnamoorthis@gmail.com, *gopalshyambabu@gmail.com 


ABSTRACT 


The paradigm of recommendation systems (RS) has witnessed remarkable evolution in terms of providing 
accurate recommendations to the users. However, it is a complex task to generate appropriate 
recommendations to the users. In this context, RS use Artificial intelligence (AI) based techniques to 
recommend products based on the customer’s preference. The adaptability of these techniques suffer from 
complexities systems such as data availability, changes in the user preferences, and unpredictable items. This 
motivates the researchers to emphasize performance enhancement of RS by overcoming these problems. This 
review focuses on the implementation of deep reinforcement learning (DRL) algorithms for RS. The study 
discusses different design aspects of RS and summarizes DRL-based techniques applied for recommendation 
systems. In addition, this review analyzes the challenges and relevant solutions based on the existing literary 
works. This paper also discusses the open issues of DRL and highlights the potential research directions in 
the RS field. 

Keywords: Comparative Analysis, Deep Reinforcement Learning, Policy Optimization Algorithms, 


Recommendation Systems 


1. INTRODUCTION 


1.1 Background of Recommendation Systems 
Recommendation systems (RS) are 
considered to be an essential part for most of the E- 
commerce and online systems. The evolution of 
social web applications has elevated the need for 
online service [1]. The RS models work considering 
the current interest of the user and do not emphasize 
on their long-term preferences. This is mainly due to 
the frequent variations observed in the user’s 
interests which changes over time based on their 
interests, actions, preferences, and requirements. 
This dynamic behavior of users affects the 
functioning of recommendation systems. Due to this 
fact, most of the RS are designed considering short 
term interests of the users. Recently, online service 
platforms are using advanced and learning-based 
models for selling their products online by generating 
personalized recommendations to the users [2]. 
Ample research works have analyzed the necessity of 
developing an effective RS which can predict the 
interests of the users. However, it is not an easy task 
to generate recommendations tailored for customers. 


In this context, online platforms are using 
Artificial intelligence (AI) to recommend or suggest 
products by interpreting customer preferences. 
Conventional machine learning algorithms (ML) fail 
to interpret large scale data without adequate 
training. On the other hand, deep learning-based RS 
are not effective in capturing interest dynamics since 
they are trained on the existing dataset which might 
not define real-time user preferences that change 
rapidly. The process of deep reinforcement learning 
(DRL) is different from machine learning and deep 
learning models in terms of its ability to learn 
through the agent by directly interacting with the 
external environment, without requiring any 
exemplary supervision. Since it learns directly from 
the environment, DRL-based models can make 
appropriate decisions and manage dynamic user 
preferences when implemented for generating 
recommendations. Hence the attention is shifted to 
DRL-based recommendation system (DRL-based 
RS) for generating recommendations and to improve 
long-term predictions. However, the dynamic 
variation and uncertainty in the user preferences and 
interests makes it complicated to suggest products. 
This motivates the researchers to focus more on 
improving the recommendation quality by 
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overcoming these problems. This research presents a 
detailed evaluation on the design and application of 
RS using DRL. 

This review article focuses on discussing the 
emerging topics, open issues, challenges, and 
research gaps observed from existing literary works 
related to DRL-based RS and provides a clear 
perspective on this advancing domain. The novel 
contributions of this review article are as follows: 

e This survey provides a comprehensive 
analysis on the design aspects and 
considerations of DRL-based RS including 
important RL algorithms for their policy 
optimization. 

e = This article provides the summarization and 
comparative analysis of DRL-based RS 
which includes the summary of evaluation 
metrics and provides detailed insights of 
selected journal publications. 

e = This survey emphasizes policy optimization 
algorithms such as TRPO, PPO, and DDPG 
for optimizing the RS. 

e This survey presents an empirical analysis 
of the challenges and issues associated with 
DRL-based RS. 


1.2 Research Significance 

RS is investigated extensively in recent 
times. Extensive literature review has _ been 
conducted by various researchers in this aspect. The 
work presented in [3] reviewed the state of art of 
cross domain RS (CDRS) which can generate 
recommendations based on the user’s interests. This 
work fulfills the gap demanding a systematic survey 
of CDRS using deep learning algorithms. However, 
the existing studies are restricted to the evaluation of 
recommendation systems using deep learning 
algorithms only. There is a collective demand for a 
literature review that extends its research scope 
beyond deep learning algorithms. An attempt to 
overcome this drawback is done by the work 
presented in [4] which provides a systematic review 
of DRL for RS. This review article discussed the 
motivation behind the application of DRL for 
recommendation systems. Existing works reviewed 
different DRL-based RS and summarized the 
existing techniques. However, the review focuses 
only on the analysis of DRL algorithms, evaluation 
and comparison of different DRL models such as 
single agent models, multi agent models, hybrid 
models etc and does not discuss the complexities 
associated with the design aspects. 


1.3 Research Design Considerations 


The review paper is designed considering 
the significance of the DRL in the design of RS. For 
structuring the review, several relevant articles were 
sourced based on the keywords and search strings. 
The articles were assessed from different search 
engines such as Elsevier, Springer, Research Gate, 
Journals and Conference papers related to the design 
of RS and Google scholar. In addition to this, the 
keywords are formulated using Google search trends. 
These sources are considered as the most valuable 
sources used for obtaining high quality research 
articles and journals. The relevant articles are 
sourced from the electronic database. 


The articles were filtered using multiple filtration 
criterions. Based on the criteria, the articles were 
included or excluded from the review. In the first 
stage of filtration, the articles related to DRL-based 
RS were collected from the search engines and online 
databases using the keywords. All keywords were 
considered in this search. In the second stage of 
filtration, the articles were selected only if the papers 
had keywords and strings. In the third stage of 
filtration, the articles based are excluded on year of 
publication and journal of publication i.e., papers 
older than 2004 were not considered for the review. 
In the last and fourth stage of filtration, the articles 
were sourced based on the abstract i.e, if the abstract 
is relevant to the study, then the articles are 
considered, else the articles were excluded. After 
filtering, an overall 50 papers were finalized for 
conducting the review. 


The criterions applied for selecting the articles to 
review helped in the precise interpretation of the 
results since they provide valuable insights about the 
design aspects of DRL, mechanism and application 
in the design of RS. In addition, the research design 
and selection criteria helped in providing a coherent 
narrative, making it easier to identify overarching 
patterns and draw accurate conclusions. 


The review article is further organized as follows: 
Section 2 presents a brief overview of different types 
of DRL and their working process. Section 3 
discusses the design aspects and considerations of 
DRL-based RS. Section 4 _ discusses the 
summarization and different comparative analysis of 
DRL algorithms, and Section 5 outlines the 
challenges and issues associated with DRL-based 
RS. Lastly, Section 6 concludes the paper with value- 
added information extracted from the existing 
literature works. 
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2. DEEP REINFORCEMENT LEARNING 
(DRL) 


The DRL algorithm is one of the efficient 
learning-based algorithms which outperforms ML 
models in terms of computational performance, 
accuracy, and ability to process complex data 
patterns [5]. The DRL technique employs an efficient 
Q-learning mechanism which allows the system 
parameters to make decisions automatically without 
requiring any previous knowledge of the 
environment. Q-learning does not require any policy 
for learning and it learns from the actions of the 
model. The ability of Q-learning allows the system to 
take appropriate actions by observing the system 
environment. This reduces the requirement of 
additional resources for training the algorithm. 


The RL algorithm uses an agent-environment 
interface for modeling the reinforcement problem as 
illustrated in figure 1. 
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Figure 1: The agent-environment interface for modeling 


In RL modeling, the agent-environment interface 
consists of an agent and its action. Here, the agent is 
termed as the learner of the decision maker and the 
environment is where the actions take place. For a 
time step ‘t’, the agent senses or observes the state of 
the environment and based on the current state the 
model performs an action. The action is interpreted 
by the interpreter and for every action the model 
receives a reward from the environment. Typically, 
the RL problem is computed as a Markov Decision 
Process (MDP) and is defined as a tuple denoted by 
(S, A, R, P, vy), where S and A are defined as the set 
of all possible states and available actions for each 
state respectively, R signifies the reward, P is defined 
as the probability of transition, and y represents the 
discount factor and cumulative reward as shown in 
equation 1: 


E-ISSN: 1817-3195 
max E|R(r)| a siaisvere (1) 
Where Rr) = Le=0 YT (Ae, St) 
O<y<1 


The primary components of the RL model are policy, 
reward shaping, value function, and model. 

(i) Policy: The term policy in RL is denoted by a 
which evaluates the probability of taking an action 
‘a’ for a given state ‘s’. Based on the performance, 
the RL policies are categorized as on-policy and off- 
policy methods. In on-policy methods, the RL model 
aims to improve the policy in order to make better 
decisions. On the other hand, off-policy methods use 
a policy which is different from the actual policy 
used for generating the information. 

(ii) Reward Shaping: In RL tasks, the process of 
reward shaping is carried out to obtain appropriate 
localized advice for achieving maximum expected, 
discounted reward in a MDP process. Reward 
shaping is a process of embedding the knowledge of 
the environment into an RL model in order to train 
the algorithm to achieve faster and accurate 
solutions. 

(iti) Value function: The value function validates the 
action taken by the RL model as good or bad action 
at a longer run. 

(iv) Model: The model provides an accurate analysis 
about the behavior of the environment for a particular 
state. 


2.1 Algorithms for solving RL problem 

Several algorithms are proposed in various 
existing literary works for solving the RL problem. 
These algorithms are categorized as tabular and 
approximate processes. In the tabular process, the 
value functions are tabulated in the form of tables. 
Dynamic programming, Monte Carlo (MC), and 
temporal difference (TD) are some of the prominent 
tabular methods. The dynamic programming (DP) 
based models assume that the RL model is 
appropriate for the environment and hence use a 
value function for searching good policies. The DP 
models incorporate two prominent algorithms known 
as policy iteration and value iteration. The MC based 
methods do not assume that the environment is 
suitable for the RL model and hence they need 
complete information about S, A, and R from the 
environment. On the other hand, the TD based 
methods integrate the functionalities of both the DP 
and MC processes. However, the TD models can 
update the information by itself without requiring 
any information about the environment. One of the 
prominent algorithms for solving RL problems is the 
Q-learning and SARSA algorithms which follow the 
off-policy and on-policy process respectively. 
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Another category of algorithms is the approximate 
techniques which approximate the size of the state 
and the environment. The approximate techniques 
identifies an appropriate approximate solution 
without increasing the computational complexity and 
number of resources. Policy gradient methods is one 
of the excellent approximate solutions which uses a 
parameterized policy for learning and decides the 
actions without depending on the value function. And 
among policy gradient methods, REINFORCE and 
actor-critic [6] are the most prominent algorithms. 


Deep learning (DL) algorithms based on neural 
network architecture have attracted the researchers 


because of its excellent performance. The sounding 
performance of the DL algorithms has motivated the 
researchers to incorporate the intelligence of DL 
algorithms into traditional RL algorithms and 
develop a deep Q-network (DQN), which is used as 
an approximate technique for Q-learning. Further 
this idea is extended in another policy gradient 
algorithm called as deep deterministic policy 
gradient (DDPG) which is a combined form of DQN 
and conventional DPG. 


2.2 Comparative analysis of DRL applications 


The application of different DRL algorithms are 
summarized in this section. 


Table 1. Algorithms for problem solving in reinforcement learning models. 


References Category 


[7] Temporal difference (TD) 
(Q-Learning and SARSA) 


Reinforcement 
Learning (RL) 


Algorithm used 


Applications 


Estimation, Smooth 
Approximation, And Optimal 
Placement 


[8] 


Monte Carlo (MC) 


Real-time online learning, 
Enhancement of the convergence 
speed and performance of DQN, 


[9] Dynamic programming 
(Value/policy iteration) 


Solving decision making 
problems and convergence 
problems. 


[10] 


Deep 
Reinforcement 


Learning (DRL) 


Q-Learming (DQN) 


Achieving accurate forecasting, 
reducing computation 
complexity and improving state 
representation 


[11] 


Actor-Critic To learn a safe and non- 
conservative policy, to mitigate 
model bias, and to reduce the 
physical interaction with 


external environment 


[12] 


REINFORCE 


To achieve sub-optimal policy in 
two-stage recommendation 
systems, and analyzing user 

profile for personalized 
recommendations 
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Table 2. Summarization of different DRL algorithms 


Reference RL State Action Reward Evaluation 
algorithm method 
[13] Q- Learning K-tuples Approximation Prediction Offline 
of Bayesian Accuracy method 
Interface 
[14] SARSA Current resource Task Smaller Response On-line 
consumption state Offloading and Time policy 
Resource learning 
Allocation 
[15] SARSA (A) Series of discrete time Interaction High learning On-line 
steps between the efficiency policy 
Agent and learning 
Environment 
[16] Value Iteration Input /Output (IO) data Closed loop Optimal control Optimal 
control action MDP policy 
[17] Policy A set of variables obtained Controlling A possible reward Offline 
Iteration from interaction with RL for optimization training 
environment components problem 
[18] Q- Learning & Flappy Bird game Performing Positive reward for Offline 
SARSA flap and “do continuous action simulation 
not flap” and online 
action 


The emergence of DRL models has been a turning 
point in the design of recommendation systems. One 
of the outstanding abilities of DRL algorithms is to 
solve complex high dimensional state and action 
spaces which is most common in RS with large state 
and action spaces. As inferred from existing works, 
DQN is considered to be the most popular algorithm. 
DQN transforms the operation of traditional Q- 
learning algorithms by employing an experience 
relay, for updating weights in training phase, by 
reducing the computational complexity and by 
mitigating the effect of error derivatives. These 
factors help DQN achieve high stability compared to 
Q-learning. However, there are certain limitations 
which restrict the adaptability of DQN such as; 
overestimation of action values for certain situations 
makes it an inefficient learning algorithm and can 
result in suboptimal policies. To overcome this 
problem, DDQN algorithm is used in various 
recommendation systems [19]. Another limitation of 
DQN is that it randomly selects the experiences 
irrespective of their significance which affect the 
speed of the learning process. This problem is 


alleviated using policy gradient algorithms which not 
a value need function for approximation. The most 
popular policies are actor-critic and REINFORCE 
methods. These policies can update the policy 
weights directly. However, these methods suffer 
from high variance and slow learning problems. 


3. DESIGN ASPECTS OF DRL-BASED RS 


The design of DRL-based RS is discussed using 
three main stages: phases in RS, types of RS, and 
optimization of RL algorithms for RS. 


3.1 Phases involved in RS 

For recommending any item or a product, 
the RS uses three important stages which are 
discussed in below points: 


(i) Information Collection Phase: In general, the 
recommendation system extracts a huge amount of 


information from large scale datasets. The 
information collected is related to user interests, 
preferences and ratings. 
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(ii) Learning Phase: The DRL algorithm is applied 
for the data collected from the previous phase and is 
processed by filtering the irrelevant and redundant 
information. Further, the user’s features are applied 
for generating relevant recommendations. 

(iii) Prediction or Recommendation Phase: In this 
stage, the DRL algorithm is trained to predict the 
items or products based on the observed activities of 
the user, preferences and interest. 


The learning and prediction phases combined 
together helps the DRL algorithm to generate 
relevant recommendations. However, the learning 
process for each RL algorithm is different and based 
on this, the DRL-based RS are broadly categorized 
into model-based (MB), and model-free (MF) 
methods. 


3.2 Types of DRL-based RS 

3.2.1 Model-based Methods: 

The MB approaches employ a learning model for 
adapting themselves to new tasks in complex 
environments and in real-time scenarios. In model- 
based methods, if the values of the dynamics of the 
models p (Xt+1 | Xt, Ut) are known, then it is simple to 
approximate them using the learned model Q(xt#1 | Xt 
, ur) which can be used for optimal control. For a 
linear dynamic function and quadratic rewards, the 
function Q (x, ur ) and V (xt) defines the action-value 
function and value function respectively. These 
functions are computed using dynamic 
programming. 

The model-based approaches have been broadly 
researched and there have been a list of publications 
which discussed the effectiveness of these 
approaches, which are tabulated in table 3. 


Table 3. Model-based DRL for recommendation systems 


Method Reference 
Value-based Method [20] 
Policy-based Method [21] 

Hybrid Method [22] 


e =Challenges/Issues associated with Model- 
based DRL for RS 
It is inferred from existing studies that the 
performance of model-based DRL for RS 
deteriorates when implemented for handling large 
scale data. In value based methods, exploration is 
necessary to learn the stochastic values of the 
systems. These models with learned dynamics are 
affected by local minima which is more adverse than 
using ground-truth dynamics. Besides, the prediction 
or recommendation error increases with time for 
unknown conditions. Hybrid methods which 
integrate the attributes of both value-based and 
policy-based methods require more exploration in 
terms of its adaptability in recommendation systems. 
3.2.2 Model-free Methods: 
The MF approaches are more efficient in learning 
complicated environments, but it requires a greater 
number of iterations for achieving convergence 
which leads to local minima. These algorithms help 
to solve complex and high-dimensional problems. 


Also, model-free algorithms overcome the problem 
of requiring large numbers of samples as required by 
model based algorithms. However, these processes 
require a relatively large number of samples. In 
another case, off-policy algorithms employ Q- 
function approximation for obtaining superior data 
efficiency. The Q-function Q” (x; , u;) for a defined 
policy z is determined for an expected return award 
from xt after performing action u: and following the 
generated policy a: The Q-function is defined as 
follows: 


Q™ (Xt uz) = Ext => t, X¢ > t~E, ut > 
t~m[Rz|Xz, Ut]... (2) 


The Q-learning learns a greedy deterministic policy 
u(x) = argmax,,Q(x;,Uz) which corresponds to 
the function m(u;|x,) = 5 (ut = u(x). 
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The review of model-free approaches for DRL-based 
RS are discussed in table 4 


Table 4. Model-free DRL for recommendation systems 


Method Technique used Reference 
Vanilla DQN [23] 
Value-based Method Appropriate state and action [24] 
optimization 

DQN with graph/image input [25] 

DQN for joint learning [26] 

Vanilla REINFORCE [27] 

Policy-based Method REINFORCE uses graph [28] 
structure/input 

Non-REINFORCE model [29] 

Hybrid Method Vanilla DDPG [30] 

DDPG with knowledge graph [31] 


The learning ability of policy-based methods is 
advantageous compared to value-based methods, 
since they do not need a value function for learning. 
In addition, policy based methods are simpler, and 
more deterministic than value based methods. On the 
other hand, hybrid methods combine the advantages 
of both action and value based methods are also 
gaining huge significance because of their ability to 
solve continuous action problems. It can be inferred 
that, DDPG with knowledge graph performs better 
compared to Vanilla DQN, DQN with graph input 
and joint learning. This is mainly because, DDPG is 
an actor-critic model that learns the policy directly 
from model parameters whereas DQN learns the Q 
values which define the policy. Training DDPG can 
be really challenging because of its unstable learning 
characteristics. In addition, DDPG is 
computationally intensive compared to DQN. This 
validates the effectiveness of DQN in deterministic 
tasks. However, it is not practically proved that DON 


performs better than DDPG in discrete tasks. This 
aspect requires more exploration and validation. 


Very few literary works on DRL-based RS have 
focused on policy based methods and hybrid methods 
such as DDPG. Most of the survey papers have 
focused on value function approaches, MDP, 
knowledge graphs etc. This work emphasizes policy 
optimization methods to address this research gap. 


3.3 Recommendation Policy optimization using 
RL algorithms 

In general, the DRL-based RS suffer from certain 
unique problems such as reward estimation, state 
construction, and simulation of — external 
environments. To overcome these problems, DRL 
algorithms are optimized using on-policy and off- 
policy optimization algorithms. Three important 
policy optimization algorithms are discussed in this 
section. The DDPG algorithm can overcome the 
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drawbacks of DQN in terms of better control 
performance, while TRPO improves the performance 
of DDPG by ensuring a long-term reward. PPO 
further optimizes TRPO by transforming the 
surrogate objective 


function, which enhances the efficiency and reduces 
the computational complexity. 

3.3.1 Trust Region Policy Optimization (TRPO) 
Algorithm: 

The TRPO is one of the prominent On-policy 
reinforcements learning based techniques which is 
adopted for updating the state of the system using the 
data generated by the present state of the system. The 
TRPO algorithm is mainly used for performing 
optimization of various large nonlinear components 
such as neural networks. This algorithm assures the 
theoretical monotonic improvement by estimating 
the gradient of the expected return Ve y (me) using 
likelihood ratio as shown in below equation: 


Von (me) = x= Lika Dio Vo logme (aélse)(Ri 
bi).... B) 


Where N is defined as the number of iterations, T is 
the total number of steps involved in every iteration, 
Ri = Xp=', bf “try is the cumulative reward function 
and b; represents the variance for narrowing the 
baseline. The potentiality of the TRPO is maximized 
by employing reinforced learning which guarantees 
transformation in the policy distribution. 

3.3.2 Proximal Policy Optimization (PPO) 
Algorithm: 

The PPO technique acting as TRPO attempts to 
predict the ascent direction of the policy gradient of 
an expected return. This limits the changes in the 
operation of the policy to very small values. The 
policy gradient methods operate by evaluating the 
estimator of the policy gradient and by transforming 
it into a stochastic gradient ascent algorithm. The 
estimator used for representing the gradient is 
defined in the form as shown in below equation: 


G = E,[Vo logme (az|s¢)Ar.... (4) 


Where zo is defined as the stochastic policy and A; 
represents the estimator which belongs to an 
advantage function for a time period t. The term Ei 
denotes the empirical average for a definite set of 
instances in PPO that switches between optimization 
and sampling. The implementation of PPO algorithm 
using an automatic differentiation mechanism is 
realized by optimizing the objective function of the 
estimator § as shown in equation 5. 


LPS) = E,[logrt (ag|s¢)At.... (5) 


A PPO algorithm which employs the trajectory 
segments of fixed length is given in the algorithm 
shown below. During every iteration, each N parallel 
element aggregates the T time steps of data and the 
surrogate losses are computed based on the NT time 
steps of data which are further optimized using 
stochastic gradient descent algorithm (SGD). 


PPO Algorithm: 


Start: 

for iteration = 1,2....do 

for acto = 1,2,....Ndo 

Run policy 18,)gin environment for T time stamps 
Compute advantage estimates Aj, A, ,....Ay 

end for 

Optimize surrogate L wrt K epochs and 

mini batch size M < NT 

Boia — 8 

end for 


3.3.3 Deep Deterministic Policy Gradient (DDPG) 
Algorithm: 

Unlike, on-policy optimization algorithms, off- 
policy techniques allow learning based on all the 
available data from random and inconsistent policies. 
Off-policy techniques augment the efficiency of the 
learning based algorithms relative to on-policy based 
processes. One such efficient off-policy technique is 
the DDPG algorithm which is used for constructing 
a MF based RL model. 


The DDPG algorithm is an _actor-critic 
conventional reinforced learning approach. The 
problem of this algorithm is proved by the policy 
gradient theorem for a stochastic policy a (s, a ; 8) as 
shown in below equation: 


_ dp _ os On(S,a) Ag 
Ana. = aya (dias, Q™(s,a)..... (6) 


Where o is defined as the positive step size and d’ 
denotes the discounted weight of the states that starts 
at So. The actor-critic relationship of the DDPG is 
explained using below expressions: The actor 1 (s; 
6") of the RL network layer depends on the present 
state of the environment s and consists of weights 0” 
and the critic of another network of reinforced 
learning is denoted as Q (s, a; 0°). The critic of the 
network is updated using the Bellman equation as 
shown in below equation: 
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Q (Spt) = Ere, Star[r (Sp, Ae) + 
Q (Star, 1 (St41)]..... (7) 


The actor is updated using the chain rule and the 
weights §" are updated using the gradient loss 
function as shown below: 


Vorl = E,[VorQ(s,1(s|O”)|0°.... (8) 
E,|V,Q(s,al02)|a = 2(s|0,,)Von0(s|0”)| (9) 


The algorithm for the DDPG mechanism is given 
below. 


Algorithm: DDPG algorithm: 


Start: 
Initialization 
Input weights of Q (s,a | 62) and 1 (s | 0") 
Output weights 8% < 62,6" < @™ 
replay buffer R 
for each process do 
Initialize process N to explore state actions 
Observe initial state S, 
for each step t of process do 
Select action a, = 11(s,|9™) + Ny 
Compute a, and obtain reward r, and state S,,, 
Obtain state transition (Sy, ap, rp S¢44)in R 
Evaluate the samples stored in R 
Set yj = 1 + OQ (Sir, 7 (Sis 10” 10% 
Update critic by optimizing the surrogate loss 


L= 5 Lili — Qs, al0%)? 


Update the actor using the sampled policy gradient 
Vox ~ (=) XV, Qs, al8%) VO, m(s|0")|s; 
Update the systems: 
6? — 162 + (1 — t)8? 
6™ — 16" + (1—71)8™ 
end for 
end for 


Several researchers have discussed the prominence 
of TRPO, PPO, and DDPG algorithms for 
application of the above discussed optimization 
algorithms. Table 5 summarizes the application of 
TRPO, PPO, and DDPG algorithms for different RS. 


It can be inferred from the literature review that 
PPO and TRPO algorithms both achieve better 
performance compared to other gradient based 
methods. However, PPO outperforms TRPO in terms 
of fine tuning the DRL parameters, consistency in 
solving policy iterations and high CTR. In addition, 
both these algorithms can be trained faster with lesser 
parameters for generating recommendations. On the 
other hand, DDPG achieves better performance and 
stability compared to the DQN networks. Due to 
learning ability, DDPG algorithms are more suitable 
for generating automatic recommendations with 
better NDCG compared to other deep learning based 
models. The literature review also reveals that the 
TRPO, PPO, and DDPG algorithms yield better 
results compared to value based methods and neural 
network models. This necessitates the need for 
deeper investigation of these algorithms for RS. 


Table 5. Policy optimization algorithms for different RS 


Reference | Algorithm Application Evaluation Metrics Observation 
used and Values 
[32] PPO E-Commerce CTR = 2.843% Improvement in CTR rate compared 
to model-based and model-free 
methods 
[33] TRPO Online learning AUC = 0.739 Prediction capacity increased 3 times 
systems compared to LSTM and context- 
aware models 
[34] DDPG Page-wise Fl-score = 80.5%, and | Automatic learning of 
recommendations | NDCG=0.1872 recommendation strategy compared to 
deep CNN, and GRU. 
[35] DDPG Interactive RS NDCG = 0.8 and Improvement of 40% and 30 % in 
Precision = 0.45 terms of precision and NDCG is 
observed. 
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[36] 


[37] 


TRPO E-Commerce CTR = 0.33 (at 30k 
step) 
PPO Online simulated 
environment CTR = 0.37 
named Virtual 
Taobao 


TRPO requires less training time than 
DQN and TRPO achieves better 
performance in terms of CTR and 
single item recommendations 


PPO achieves better CTR with stable 
performance compared to DDPG 


4 SUMMARIZATIONS OF DRL-BASED A comparative table and summarization is tabulated 


TECHNIQUES 
RECOMMENDATION SYSTEMS 


APPLIED 


FOR 


This section provides the summarization and 
different comparative analysis of DRL-based RS. 


to discuss different DRL algorithms proposed by 
different researchers for RS. 


Table 6. Overview of DRL-based recommendation approaches 


Ref Recommendation Advantages Limitations 
Method 
[38] Collaborative High recommendation accuracy and Suffers from state space problems and 
filtering faster convergence requires high computation time 
[39] Collaborative Provides better results for large Data sparsity problem and cold start 
filtering scale user database, No prior problems while recommending new items 
knowledge is required, Improved 
recommendation time 
[40] Learning based Performs feature interaction The model capacity becomes limited after 
Recommendation modeling for identifying user reaching peak performance 
interaction. 
[41] Long-term Maximizes recommendation The performance of the RL algorithm is 
Recommendation accuracy with high hit rate and affected due to sampling efficiency 
NDCG problems 
Cold-start Overcome the cold-start problem in | The proposed approach focuses only on 
[42] recommendation RS and achieves high CTR and cold-start problems and is not designed 
NDCG for lifelong recommendation 
[43] Learning based The model observes improved The model is generic and underperforms 
Recommendation convergence rate and is able to while generating application specific 
compute complicated and high- recommendations 
dimensional instances 
[44] Cross-platform Overcome the problem of cold- Reduction in scalability and increase in 
recommendation start, gray sheep, and data sparsity | the computation cost is observed with the 
systems problems increase in the diversity of social 
networks and number of users. 
[45] Shared-account The SCSR is advantageous The SCSR model assumes that all 
Cross-domain compared to cross platform RS accounts are shared by the same users, 
Sequential since it considers the characteristics which is not appropriate for real-time 
Recommendation of both the shared-account and applications. In real-time shared 
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(SCSR) 


cross-domain while generating 
recommendations for shared- 
accounts. 


accounts, both the identity and number of 
the latent users are not known. 


4.1 Performance Evaluation of DRL-based RS 
The performance of DRL-based RS is 
evaluated using both offline and online evaluation 
methods [45]. During offline evaluation, the efficacy 
of the DRL-based RS is evaluated for a fixed dataset 
using metrics such as RMSE, MAE, precision, recall, 
F-measure, AUC, NDCG, MAP, hit rate, and 


perplexity Whereas in the online evaluation process, 
the DRL-based RS is tested for its ability to learn 
while interacting with the environment. CTR and 
bounce rate are the two main parameters which are 
used to determine the online performance of DRL- 
based RS. The different performance metrics used for 
DRL-based RS are discussed in table 7. 


Table 7. Summary of evaluation metrics for evaluating DRL-based RS 


Reference | Evaluation Metric Description Mathematical Formula 
Offline Evaluation 
[45] [46] | Root Mean Squared | RMSE is used in predictive and 
Error (RMSE) regression analysis. It is equal to 
the square root of the MSE. 
[47] Mean Absolute MAE is calculated as the average 
E MAE i 1 
rror ( ) of absolute difference between the (MAE) =2* 3, |x; — x 
predicted and actual values n 
[46] Precision Precision is defined as the 
accuracy of positive predictions. Precision =— 
It is also defined as the proportion CCISION TP + FP 
of accurately identified data 
which is relevant. Where TP and FP are true positive and 
false positive respectively. 
[47] Recall Recall is determined as the ratio 
of the recommendations that are TP 
Recall = —_ 
accurately classified CC" TP + EN 
Where TP and FN are true positive and 
false negative respectively. 
[47] F-measure F-measure is determined as the 
ratio of mean of its precision and Frese ee 
recall. Precision + Recall 
[46] Area Under the AUC is the ability of the RL NA 
Curve (AUC) model to distinguish different 
samples in a particular dataset 
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[47] Normalized The NDCG measure is used for DCG, = DCG, 
Discounted ranking the products based on ae IDCG 
: Pp 
Cumulative Gain their accuracy. 
(NDCG) 
[47] Mean Average MAP is computed as the mean of 7 


1 
the average precision value based mAP = —). AP; 
on the relevance score. i=l 


Precision (MAP) 


Where AP is the average 
precision 


percentage of users who verify the 
recommendation list and do not 
explore the recommendations and 


[48] Hit rate The total number of products in NA 
the samples that are also present 
in the user list is defined as the 
number of hits. And the number 
of hits defines the hit rate. 
[56] Perplexity This metric is used for evaluating Ny 1 
the topical models which can PP(wW) = |——————— 
measure the quality of the P(W4, Wa, Wy) 
recommended items. 
Where wi, W2 ....wn represent the words 
Online Evaluation 
[49] Click Through Rate | CTR is the measure of the CTR = —lotalno ofclicks 199 
(CTR) recommendations that are clicked Total no of impressions 
by the user 
[50] Bounce Rate Bounce rate is defined as the total Single page visits 


Bounce rate = 
Total website visits 


exit the system. 


Hit-Rate and NDCG are the two main evaluation 
metrics that are used to evaluate the performance of 
DRL-based RS used for generating long term 
recommendations. For real-time evaluation, the CTR 
can help since it provides the accurate measurement 
of the recommendations. 


5. CHALLENGES AND ISSUES IN DRL- 
BASED RS 


Despite the advantages there are certain 
challenges and issues which need to be addressed. 
Existing DRL-based RSs have been found to face 
accuracy-related issues, cold-start issues and sparsity 
issues when dealing with large scale data. Apart from 
traditional issues such as cold-start and data sparsity 
problems, there are certain challenges in these 


systems which need significant attention. Some of 
the prominent research problems identified in this 
work are listed in below points: 


(i) Sample Efficiency: Sample inefficiency is one of 
the crucial issues in MF-based DRL methods. The 
number of samples required to train MF-DRL is 
significantly high and the agent interacting with the 
external environment requires sufficient training data 
to perform actions. However, the sample efficiency 
of the model-based (MB) DRL is comparatively 
higher compared to MF-DRL. However, MB-DRL 
systems are more complex since the agent is forced 
to learn both the policy and the external environment. 
(ii) Data Dimensionality: Since recommendation 
systems have to deal with complex and large-scale 
data, they often suffer from the problem of class 
imbalance and data dimensionality which 
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deteriorates the recommendation accuracy of the 
DRL models. 

(iii) Optimization of DRL: Optimizing the goals of 
DRL for multi-objective RS involves a lot of 
complexities in terms of ensuring fair 
recommendation generation, increased 
computational cost, and maintaining diversity. 

(iv) Biased and Sparse feedback data: In most of the 
RS, the feedback is usually biased. In DRL-based RS 
the 


feedback is sampled based on the interaction between 
the policy and the environment. Though off-policy 
training can enhance the policy based on the biased 
feedback data, the sample efficiency is reduced 
significantly. Besides, the RS is designed to handle 
millions of users with indistinct choices which makes 
the DRL user state space more complex to interpret. 
(v) Online deployment: Deploying DRL-based RS 
for handling multiple scenarios and multiple 
customers at a time can be challenging. Training 
DRL models while achieving high relevance and 
CTR is crucial and requires robust training. This 
issue can be resolved by deploying an online training 
model. However, online deployment is still in the 
infant stage and requires deeper investigation. 


5.1 Assumptions and Limitations 

Although this review paper attempts to 
cover all possible design aspects with challenges and 
issues related to DRL-based RS, there are certain 
assumptions and limitations that are undertaken in 
this study. 
5.1.1 Assumptions 

e This review assumes and generalizes the 
findings obtained across different 
recommendation systems. However, the 
interpretation of findings can vary based on 
the domain and application area. 

e = It is assumed while formulating this review 
that the algorithms and models discussed 
remain stable over time. Any small changes 
or updates in the model parameter can have 
a significant influence on the performance 
of the RS and accuracy. 

5.1.2 Limitations 

e The study mainly emphasizes the design of 
RS using DRL and performance evaluation. 
However, limiting the scope of this review 
may lead to potential research gap and 
necessitates the need for more relevant 
work. 

e Considering the rapid evolution of 
technologies related to the design of RS, this 


review might become less informative and 
require frequent updates. 

e One of the most important factors related to 
RS is the security, which is not emphasized 
in this review. 


6. CONCLUSION 


This review comprehensively analyzes the concept 
of RS with an emphasis on deep reinforcement 
learning (DRL) algorithms. The study discussed 
different DRL methods and were compared with 
respect to their state, action, reward, dataset used, 
evaluation method and metrics used for evaluation. 
The study also discussed the design aspects of DRL- 
based RS and their policy optimization algorithms. 
The analysis of TRPO, PPO, and DDPG algorithms 
shows that both TRPO and PPO algorithms achieve 
better recommendation performance in terms of 
prediction capacity, high CTR, and better learning 
ability. The DDPG algorithm exhibits better stability 
and recommendation performance compared to 
value-based methods. These policy optimization 
algorithms can be trained faster with limited 
parameters and hence are considered to be more 
appropriate for optimizing RS. More focus is 
required on the analysis of policy optimization 
methods for optimizing the performance of RS. It can 
be inferred from the review that the DRL algorithm 
has a huge significance in the field of RS and the 
optimization of RS needs a deeper investigation. The 
study also points out that the issue of data 
dimensionality can have a negative impact on the 
accuracy of the DRL-based RS. This issue can be 
resolved by integrating a feature extraction approach 
with DRL algorithms. Another prominent aspect of 
the RS is its susceptibility towards data bias and data 
dimensionality problems. These factors can pose a 
grave threat to the RS and hence focusing on solving 
them remains as a foremost challenge. It is expected 
that this review can provide a roadmap for the 
researchers to understand DRL algorithm and 
optimization algorithms and their problems and 
hence can provide valuable insight for the 
researchers carrying out research in this field. 
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